Biological Pattern Discovery with R Machine Learning Approaches (Zheng Rong Yang)

oach can speed up the multiple sequence comparison dramatically

equences.

(a) (b)

a) The CPU time comparison between the alignment-based approach using the

-Wunsch algorithm and the alignment-free approaches using the k-mer word

ibrary approach for sequence comparison. The horizontal axis stands for the

ength. (b) The accuracy comparison between the alignment-based approach

Needleman-Wunsch algorithm and the alignment-free approach, i.e., the k-mer

ency library approach for sequence comparison. ‘rho’ stands for the correlation

and ‘p’ is the correlation test p value.

he accuracy comparison

the alignment-free approach is accurate was also examined. It

o investigate whether the distance measurement between

s of the alignment-free approach was correlated with the

t distance of the alignment-based approach. One hundred pairs of

ucleotide sequences were randomly generated. The mutation rate

ween 0.5% and 5%. Every mutation rate was repeated for 500

herefore, there were 50,000 trials in total. Both the alignment-

proach and the alignment-free approach were applied for each

ndom pseudo nucleotide sequences. First, the distance percentage

ned as the ratio of the alignment distance over the alignment

he correlation between the distance percentage and the distance

mer word frequencies was tested. Figure 7.14(b) shows the result.

seen that the correlation coefficient was greater than 0.927 and